Skip to content

stop rollouts on incomplete responses (no content or tools)#948

Open
willccbb wants to merge 5 commits intomainfrom
will/incomplete-responses
Open

stop rollouts on incomplete responses (no content or tools)#948
willccbb wants to merge 5 commits intomainfrom
will/incomplete-responses

Conversation

@willccbb
Copy link
Member

@willccbb willccbb commented Feb 21, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Changes core rollout termination/truncation behavior in MultiTurnEnv, which can alter evaluation/training outcomes when providers emit empty responses.

Overview
Multi-turn rollouts now terminate when the model returns an “incomplete” response (no message content and no tool calls). This is implemented as a new @vf.stop condition (has_incomplete_response) and by marking such trajectory steps as truncated in MultiTurnEnv.add_model_response.

Docs are updated to mention incomplete-response detection as a default stop condition, the wiki-search environment strips <think> content before LLM judging, and dataset builder fields in Environment are explicitly cast() for type safety; .gitignore also ignores packages/tasksets and packages/harnesses.

Written by Cursor Bugbot for commit 92a8da8. This will update automatically on new commits. Configure here.

@willccbb willccbb requested a review from eligotts February 21, 2026 03:02
…ns list

Co-authored-by: will brown <willccbb@users.noreply.github.com>
Copy link
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we handle this as a stop condition and not a vf.Error (this is what we currently do in openai_chat_completions_client.py

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

judge_response = await judge(prompt, completion, answer, state)
cleaned_completion = [
{x["role"]: x["content"].split("</think>")[-1] for x in completion}
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict comprehension creates wrong message structure for judge

High Severity

The dict comprehension {x["role"]: x["content"].split("</think>")[-1] for x in completion} creates a single dictionary with role names (e.g., "assistant", "tool") as keys and cleaned content as values, wrapped in a list. This produces a structure like [{"assistant": "...", "tool": "..."}] instead of the expected list of message dicts with "role" and "content" keys. When the judge's parse_answer tries to find assistant messages, it looks for a "role" key in each element — which doesn't exist in this dict — so it always returns None, making the judge evaluate against a None response. The brackets likely need to be moved so the list comprehension wraps each message individually.

Fix in Cursor Fix in Web

async def judge_reward_func(judge, prompt, completion, answer, state) -> float:
judge_response = await judge(prompt, completion, answer, state)
cleaned_completion = [
{x["role"]: x["content"].split("</think>")[-1] for x in completion}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split on None content causes AttributeError

Medium Severity

AssistantMessage.content is typed as MessageContent | None and defaults to None for tool-call-only messages. The expression x["content"].split("</think>") will raise an AttributeError when content is None. In a multi-turn tool-use environment like wiki_search, assistant messages with only tool_calls and no content are common. The error is silently caught by _call_individual_reward_func, returning a reward of 0.0, which silently corrupts training signal.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants